Gapminder Dataset Analysis

This notebook, data from the Gapminder project will be analyzed and visualized.

Most prominently, the relationship between GDP and CO2 emissions will be investigated. Additionally, the relationship between continent and energy use is going to be explored and questions regarding the imports of goods and services, the population density and life expectancy are going to be answered with focus on specific geographic locations and time.

The Data

Feel free to explore the data in the interactive dataframe: (unfortunately, there is no way to keep ipython widget call back functions working (outside of the saved state) -- but I am leaving it in, to see if you know more than I do here and can tell me a trick how to keep functionality here)

Relationship between the CO2 emission and GDP per capita in 1962

Can a relationship between the CO2 emission and GDP per capita in 1962 be identified, when comparing data from different countries?

To answer that question, all data from 1962 is extracted from the dataset.

Scatter Plot

In 1962 there was a clear positive relationship between Carbon Dioxide emissions and GDP per capita. As many carbon dioxide emitting technologies, such as cars, were a lot more costly during that time it makes sense that citizens of higher income countries would be more likely to be able to afford those technologies. The plot clearly shows the majority of the carbon producers being high income European, North American and Oceanic and some Gulf countries.

Today the situation is different, with carbon dioxide emitting technologies being widely available and more affordable. The situation also changes in terms of environmetally friendly technologies whcih have emerged in recent years and with the consciousness about the dangers of carbon dioxide emission. Today, many high income countries have eavily invested in carbon reducing technologies and are more likely to use environmentally friendy alternatives, so that it is likely to see a tredn reversal for the same plot with 21st century data.

Pearson Correlation

In fact, upon closer inspection it becomes clear, that these two parameters are strongly correlated in 1962.
The data shows a correlation of R=0.93, with a p-value of p << 0.001 (p=1.13e-46), which means that it is highly significant.

The correlation between these two parameters reached its peak in 1967 with a correlation of R=0.94 (p=3.4e-53), after which (as previosuly hypothesized) started rapidly decreasing, until it reached a low of R=0.72 (p=9.2e-22) in 2007. The datasets ends in 2007, but it can safely be assumed that the correlation has further drastically decreased in the past 14 years.

Correlation of the parameters per year (1962 - 2007)

What is the relationship between continent and 'Energy use (kg of oil equivalent per capita)'?

Can a relationship between the continent and unergy use be identified for 1967?

To answer that question,once more the data is extracted from the dataset (this time from 1967).

As the relationship between a categorical (continent) and a continuous (energy use) parameter is being investigated and more than 2 groups are being compared, an ANOVA test would be most usefule. However, there are several assumptions that are made when conducting an ANOVA test, including:

Testing ANOVA assumptions

In order to decide which statistical test (ANOVA vs Kruskal-Wallis) to use, it needs to be tested if theassumtions necessary for an ANOVA test are met.
The Levene's Test is used to verify that the variance between the groups is equal

As at least one assumtion required to conduct an ANOVA test is not met, a Kruskal-Wallis test is conducted instead.

This shows that the Energy consumption varies significantly between continents.
A Tukey's test will bring more clarity about which groups differ from each other:

Based on the Tukey's test, it could be shown that the energy consumptions varies between most all continents, with 2 exceptions.
Energy consumtion did not differ between the Americas and Asia and did not differ between Europe and Oceania.

Is there a significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' in the years after 1990?

In order to answer this question, two subsets of data have to be created and compared to each other:

Based on the results of the t-test, the was no significant difference between Europe and Asia in terms of imports of goods and services measured as a % of GDP. Both continents have countries which are strong exporters and strong importers, whose exports seem to balance each other, so that the net imports between the two continents are comparable.
Comparing individual countries, would make the results differ to a much larger extent.

There is one country that is clearly an outlier: Singapore.

What is the country (or countries) that has the highest 'Population density (people per sq. km of land area)' across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

In order to answer this question, the dataset needs to be divided into one subset per year. In a second step, the country with the highest value in the column "Population density (people per sq. km of land area)" needs to be identified.
The results are as follows:

Monaco (1962, 11967, 1977, 1982 and 2007) and the autonomous Chonese region Macau (1972, 1987, 1992, 1997 and 2002) take turn as the country with the highest population density. Both locations are micro states/city states, which increases their population density, as they have the density of most countries' cities, without having any large rural areas.

What country (or countries) has shown the greatest increase in 'Life expectancy at birth, total (years)' since 1962?

To answer this question, the increase from 1962 to 2007 has to be calculated for every country. Then, in a second step, these calculated values need to be compared to each other and the maximum value needs to be identified.

The Maldives (+37y), Bhutan (+33y) and Timor-Leste (+31y) had the highest increases in life expectancy in the time frame form 1962 - 2007.
All three of these countries had unproportionally low life expectancies, which allowed for extreme increases of life expectancy in the following years.